Aim: Detect aberrantly expressed genes and pathways in ATG7 patients

1 Background

1.1 OUTRIDER

  • OUTRIDER is a bioinformatics software that aims at detecting aberrantly expressed genes from RNA-seq data. It is specifically designed for a rare disease context through correcting for co-variations that are not known a-priori using an autoencoder. This is claimed to help distinguish outliers more readily. OUTRIDER then applies a negative binomial test to find outlier expressed genes.
  • Previous analysis using OUTRIDER has demonstrated that it's autoencoder correction looks to overcorrect the RNA-seq data when controls are obtained through a different sequencing protocol (GTEx vs in-house). This resulted in all patient samples having mitochondrial pathway genes enriched for dysruptions. However, upon testing an in-house sample without expected mitochondrial disease, the same enriched pathways were discovered. Overall, this suggests that OUTRIDER was entirely picking up on technical differences rather than biological.
  • Now, we have sequenced 5 (1 replicate) patients with ATG7 mutations and have ~50 in-house sequenced controls. Although these controls have differing batches and some, originate from different sequencing companies they have matching sequencing depth, length and library construction. If the technical differences look to be the promiment result still, we can narrow now the cohort to only those samples sequenced in the last batch (~20 samples).

1.2 DESeq2

  • DESeq2 models reads using a negative binomial distribution, similar to OUTRIDER. DESeq2 internally normalises gene counts for library size and sex + batch of samples were used as a covariate. A binomial test was used to determine the abnormality of each genes expression with the ATG7 patients, compared to the expression within the remaining patients.

2 Method

  1. Obtain a gene count matrix using GenomicAlignment::summarizedOverlaps via RNA-SEQC (matching the GTEx protocol).
  2. Generate patient sex via coverage across XIST & DDX3Y and mark batches.
  3. Run OUTRIDER on ATG7 samples with either all remaining patients (50) or only those of batch 2 (20):
    1. Filter genes by expression
    2. Find optimal encoding dimension
    3. Correct counts using autoencoder
    4. Obtain p-values using negative binomial test
  4. Run DESeq2
  5. Run gene-set enrichment analysis

3 Obtain gene count matrix

  • Foremost, we need to obtain the gene counts for patients. Here, we will follow the GTEx protocol as described here: https://gtexportal.org/home/documentationPage#AboutData. This involves using GENCODE v26 (which we will switch for Ensembl v97) and counting up counts across genes using RNA-SEQC.

3.1 Generate genes-only version of Ensembl gtf


cd /tools
git clone https://github.com/broadinstitute/gtex-pipeline
chmod -R 775 gtex-pipeline

# generate the genes gtf
/tools/gtex-pipeline/gene_model/collapse_annotation.py \
/data/references/ensembl/gtf_gff3/v97/Homo_sapiens.GRCh38.97.gtf \
/data/references/ensembl/gtf_gff3/v97/Homo_sapiens.GRCh38.97.genes.gtf

3.2 RNA-SEQC

  • This script generates a gene count matrix across all Ensembl genes using
source("/home/dzhang/projects/ATG7_rob_t_analysis/scripts/aberrant_expression/get_gene_count_RSE.R")

4 Patient metadata

load("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/50_controls/gene_count_ods.rda")

colData(gene_count_ods) %>% 
  as.data.frame() %>% 
  as_tibble()

5 ATG7 expression across samples

  • Let's first have a quick check of the expression of ATG7 across samples with ATG7 mutations vs controls.
load("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/50_controls/gene_count_ods.rda")

which_ATG7 <- which(colData(gene_count_ods)[["gene_name"]] == "ATG7")
which_control <-  which(str_detect(colData(gene_count_ods)[["samp_id_tidy"]], "control"))

ATG7_only_ods <- gene_count_ods[names(gene_count_ods) == "ENSG00000197548" , c(which_ATG7, which_control)]

assays(ATG7_only_ods)[["fpkm"]] %>% 
  as.data.frame() %>% 
  gather(key = "samp_id", value = "fpkm")  %>% 
  mutate(family = c(2, 2, 3, 4, 1, 1, "control", "control")) %>% 
  ggplot(aes(x = samp_id, y = fpkm, fill = as.character(family))) + 
  geom_col(colour = "black") + 
  scale_x_discrete(name = "Sample ID") + 
  scale_y_continuous(name = "FPKM") + 
  scale_fill_manual(name = "Family", 
                    values = ggpubr::get_palette("npg", 5)) +
  theme_bw()

  • Family 1 and 2 look to have significantly lower expression of the ATG7 gene.

6 OUTRIDER

  • We will run OUTRIDER either using all remaining 50 patients as controls or using the 20 (in the second batch). The latter analysis ensures that there will be no batch differences between cases/controls and matches the proposed usage of OUTRIDER in their paper.
  • I have masked samples that have the same pathogenic mutation when finding the optimal encoding dimension as OUTRIDER assumes "unrelated" samples. They also give this as an example in their workflow (https://www.bioconductor.org/packages/release/bioc/vignettes/OUTRIDER/inst/doc/OUTRIDER.pdf).

6.1 50 controls (different batches included)

load("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/50_controls/ATG7_res_all.rda")
load("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/50_controls/gene_count_corrected_example.rda")
load("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/50_controls/gene_count_ods.rda")

6.1.1 Filter by expression

  • There are 12339 genes remaining after filtering by expression.
knitr::include_graphics("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/50_controls/FPKM.png")

knitr::include_graphics("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/50_controls/expressed_genes.png")

6.1.2 Pre-correction gene count heatmaps

knitr::include_graphics("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/50_controls/count_cor_pre_correction.png")

knitr::include_graphics("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/50_controls/count_gene_sample_pre_correction.png")

6.1.3 Post-correction gene count heatmaps

Here, I have included one example. In reality, there's plots like these for each ATG7 sample as I have re-run each ATG7 with the other ATG7 samples removed.

knitr::include_graphics("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/50_controls/count_cor_post_correction_M0920.18.png")

knitr::include_graphics("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/50_controls/count_gene_sample_post_correction_M0920.18.png")

  • It still looks like the count correlations are tagging the batch of the sequencing.

6.1.4 OUTRIDER results

  • I have used the FDR method for multiple test correction across every gene tested across all samples.
which_ATG7 <- which(colData(gene_count_ods)[["gene_name"]] == "ATG7")
ATG7_samp_ids <- colData(gene_count_ods)[["samp_id_tidy"]][which_ATG7] %>% 
  as.character()

stopifnot(all(ATG7_samp_ids %in% ATG7_res_all$sampleID))

ATG7_res_ATG7_only <- ATG7_res_all %>% 
  filter(sampleID %in% ATG7_samp_ids) %>%
  mutate(padjust = p.adjust(pValue, method = "fdr")) %>% 
  dplyr::select(Description, everything())
  • There are 1102 genes with outlier expression.
ATG7_res_ATG7_only_signif <- ATG7_res_ATG7_only %>% 
  filter(padjust <= 0.05) %>% 
  dplyr::select(Description, everything())

ATG7_res_ATG7_only_signif
  • The number of signif genes per ATG7 sample
ATG7_res_ATG7_only_signif$sampleID %>% table()
## .
## M0920.18 M0921.18 M1111.19 M1716.19 M1856.17    S2557 
##        7      341       12       17       38      687
  • ATG7 comes up as an outlier expressed gene only in M1856.17/S2557.
ATG7_res_ATG7_only_signif %>% 
  filter(Description == "ATG7")
  • Are there any genes that are outliers in common across ATG7_samples
ATG7_res_ATG7_only_signif %>% 
  dplyr::count(Description) %>% 
  filter(n > 1) %>% 
  arrange(desc(n))
  • There are no outlier genes detected in either of the control samples, which is reassuring.
ATG7_res_all %>% 
  filter(str_detect(sampleID, "control")) %>% 
  mutate(padjust =  p.adjust(pValue, method = "fdr")) %>% 
  filter(padjust <= 0.05)
  • As a double check, how many of the ATG7 outliers are common to any of the remaining patient samples.
signif_genes_remaining_samp <- ATG7_res_all %>% 
  filter(!(sampleID %in% ATG7_samp_ids)) %>% 
  mutate(padjust =  p.adjust(pValue, method = "fdr")) %>% 
  filter(padjust <= 0.05) %>% 
  .[["Description"]]

mean(unique(ATG7_res_ATG7_only_signif$Description) %in% unique(signif_genes_remaining_samp))
## [1] 0.1444867

6.1.5 Gene-set enrichment

  • Here, I have used a unranked gprofiler test of all significant genes across ATG7 with the background genes as all genes that have been tested through OUTRIDER.
hits_gprofiler <- 
  gprofiler2::gost(query = unique(ATG7_res_ATG7_only_signif %>% 
                                    arrange(padjust) %>% 
                                    .[["geneID"]]), 
                   organism = "hsapiens", 
                   ordered_query = F, 
                   significant = T, 
                   user_threshold = 0.05, 
                   correction_method = c("g_SCS"),
                   domain_scope = c("custom"),
                   custom_bg = unique(ATG7_res_all$geneID),
                   sources = c("GO", "KEGG", "REAC"), 
                   evcodes = F)

gprofiler2::gostplot(hits_gprofiler)

6.1.6 Autophagy genes

  • How many known autophagy genes that are part of the KEGG pathway are found amongst the ATG7 outliers?
library(KEGGREST)

query_kegg_pathway <- function(pathway_name, organism = "hsa"){
  
  # get all pathways associated with humans 
  all_pathways <- keggList("pathway", organism)
  
  # filter to only selected pathway and get information about it
  selected_pathway <- all_pathways[str_detect(all_pathways, pathway_name)]
  
  print(str_c("Getting info for the pathways: ", unname(selected_pathway) %>% str_c(collapse = ", ")))
  
  
  selected_pathway_info <- keggGet(names(selected_pathway))
  
  return(selected_pathway_info)
  
}

get_pathway_gene_info <- function(kegg_pathway_info, organism = "hsa"){
  
  pathway_gene_info_all <- tibble()
  
  for(i in 1:length(kegg_pathway_info)){
    
    odds <- (1:length(kegg_pathway_info[[i]]$GENE))[(1:length(kegg_pathway_info[[i]]$GENE) %% 2) != 0]
    evens <- (1:length(kegg_pathway_info[[i]]$GENE))[(1:length(kegg_pathway_info[[i]]$GENE) %% 2) == 0]
    
    pathway_gene_info <- 
      tibble(pathway_kegg_code = kegg_pathway_info[[i]]$ENTRY, 
             pathway_name = kegg_pathway_info[[i]]$NAME, 
             organism = organism, 
             kegg_gene_entry = kegg_pathway_info[[i]]$GENE[odds], 
             gene = kegg_pathway_info[[i]]$GENE[evens]) %>% 
      separate(col = "gene", into = c("gene_name", "gene_desc"), sep = "; ")
    
    pathway_gene_info_all <- pathway_gene_info_all %>% bind_rows(pathway_gene_info)
    
  }
  
  # for(i in 1:nrow(pathway_gene_info)){ 
  #   
  #   gene_info <- keggGet(str_c(organism, ":", pathway_gene_info$kegg_gene_entry)[i])
  #   pathway_gene_info$gene_identifiers[i] <- gene_info[[1]]$DBLINKS %>% str_c(collapse = ";")
  # 
  # }

  return(pathway_gene_info_all)
  
}

autophagy_kegg_info <- query_kegg_pathway(pathway_name = "Autophagy", organism = "hsa")
## [1] "Getting info for the pathways: Autophagy - other - Homo sapiens (human), Autophagy - animal - Homo sapiens (human)"
autophagy_gene_info <- get_pathway_gene_info(kegg_pathway_info = autophagy_kegg_info, organism = "hsa")

autophagy_gene_info$gene_name %>% unique()
##   [1] "RPTOR"     "MTOR"      "MLST8"     "ULK2"      "ATG13"     "ATG101"   
##   [7] "IGBP1"     "PPP2CA"    "PPP2CB"    "ATG9B"     "ATG9A"     "ATG2A"    
##  [13] "ATG2B"     "WIPI2"     "WIPI1"     "BECN1"     "BECN2"     "PIK3C3"   
##  [19] "PIK3R4"    "ATG12"     "ATG5"      "ATG16L1"   "ATG7"      "ATG10"    
##  [25] "ATG3"      "GABARAP"   "GABARAPL1" "GABARAPL2" "ATG4A"     "ATG4B"    
##  [31] "ATG4C"     "ATG4D"     "INS"       "IGF1R"     "IRS1"      "IRS2"     
##  [37] "IRS4"      "PIK3CA"    "PIK3CD"    "PIK3CB"    "PIK3R1"    "PIK3R2"   
##  [43] "PIK3R3"    "PDPK1"     "AKT1"      "AKT2"      "AKT3"      "PTEN"     
##  [49] "HRAS"      "KRAS"      "NRAS"      "MRAS"      "RRAS"      "RRAS2"    
##  [55] "RAF1"      "MAP2K1"    "MAP2K2"    "MAPK1"     "MAPK3"     "HIF1A"    
##  [61] "DDIT4"     "TSC2"      "TSC1"      "BNIP3"     "RHEB"      "DEPTOR"   
##  [67] "AKT1S1"    "RPS6KB1"   "RPS6KB2"   "ULK1"      "RB1CC1"    "SUPT20H"  
##  [73] "SMCR8"     "WDR41"     "C9orf72"   "RAB8A"     "RAB39B"    "SQSTM1"   
##  [79] "TBK1"      "TANK"      "RAB1A"     "RRAGA"     "RRAGB"     "RRAGC"    
##  [85] "RRAGD"     "STK11"     "PRKAA1"    "PRKAA2"    "PRKACA"    "PRKACB"   
##  [91] "PRKACG"    "PRKCD"     "MAPK8"     "MAPK10"    "MAPK9"     "BCL2"     
##  [97] "BCL2L1"    "BAD"       "NRBF2"     "ATG14"     "AMBRA1"    "TRAF6"    
## [103] "TP53INP2"  "VMP1"      "STX17"     "UVRAG"     "SH3GLB1"   "RUBCN"    
## [109] "CAMKK2"    "MAP3K7"    "ERN1"      "ITPR1"     "DAPK1"     "DAPK3"    
## [115] "DAPK2"     "HMGB1"     "PRAP1"     "EIF2AK3"   "EIF2AK4"   "EIF2S1"   
## [121] "MTMR3"     "MTMR4"     "MTMR14"    "ZFYVE1"    "ATG16L2"   "CFLAR"    
## [127] "RAB33B"    "PRKCQ"     "LAMP1"     "LAMP2"     "RAB7A"     "RAB7B"    
## [133] "VAMP8"     "SNAP29"    "CTSD"      "CTSL"      "CTSB"
  • RAB39B comes up as upregulared in one patient (M0921.18). Though this is different from the RAB9 suggested by Jack.
ATG7_res_ATG7_only_signif[ATG7_res_ATG7_only_signif$Description %in% unique(autophagy_gene_info$gene_name),]
  • NFE2L2 does not look to have abnormal expression.
ATG7_res_ATG7_only %>% 
  filter(Description == "NFE2L2")

6.2 20 controls (identical batch)

load("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/20_controls/ATG7_res_all.rda")
load("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/20_controls/gene_count_corrected_example.rda")
load("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/20_controls/gene_count_ods.rda")

6.2.1 Filter by expression

  • There are 11854 genes remaining after filtering by expression.
knitr::include_graphics("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/20_controls/FPKM.png")

knitr::include_graphics("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/20_controls/expressed_genes.png")

6.2.2 Pre-correction gene count heatmaps

knitr::include_graphics("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/20_controls/count_cor_pre_correction.png")

knitr::include_graphics("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/20_controls/count_gene_sample_pre_correction.png")

6.2.3 Encoding dimension optimisation

plotEncDimSearch(gene_count_ods)

6.2.4 Post-correction gene count heatmaps

Here, I have included one example. In reality, there's plots like these for each ATG7 sample as I have re-run each ATG7 with the other ATG7 samples removed.

knitr::include_graphics("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/20_controls/count_cor_post_correction_M0920.18.png")

knitr::include_graphics("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/20_controls/count_gene_sample_post_correction_M0920.18.png")

6.2.5 OUTRIDER results

  • I have used the FDR method for multiple test correction across every gene tested across all samples.
which_ATG7 <- which(colData(gene_count_ods)[["gene_name"]] == "ATG7")
ATG7_samp_ids <- colData(gene_count_ods)[["samp_id_tidy"]][which_ATG7] %>% 
  as.character()

stopifnot(all(ATG7_samp_ids %in% ATG7_res_all$sampleID))

ATG7_res_ATG7_only <- ATG7_res_all %>% 
  filter(sampleID %in% ATG7_samp_ids) %>%
  mutate(padjust = p.adjust(pValue, method = "fdr")) %>% 
  dplyr::select(Description, everything())
  • There are 251 genes with outlier expression.
ATG7_res_ATG7_only_signif <- ATG7_res_ATG7_only %>% 
  filter(padjust <= 0.05) %>% 
  dplyr::select(Description, everything())

ATG7_res_ATG7_only_signif
  • All significant outliers are attributed to one sample.
ATG7_res_ATG7_only_signif$sampleID %>% table()
## .
## M0921.18 
##      251
  • Now ATG7 is not an outlier anymore (does not survive FDR correction). I think that this suggests using 20 samples as controls has not worked as well as ATG7 we can see in the expression plots looks to be severely underexpressed in M1856.7.
ATG7_res_ATG7_only %>% 
  filter(Description == "ATG7")
  • As a double check, how many of the ATG7 outliers are common to any of the remaining patient samples.
signif_genes_remaining_samp <- ATG7_res_all %>% 
  filter(!(sampleID %in% ATG7_samp_ids)) %>% 
  mutate(padjust =  p.adjust(pValue, method = "fdr")) %>% 
  filter(padjust <= 0.05) %>% 
  .[["Description"]]

mean(unique(ATG7_res_ATG7_only_signif$Description) %in% unique(signif_genes_remaining_samp))
## [1] 0.4780876

6.2.6 Gene-set enrichment

hits_gprofiler <- 
  gprofiler2::gost(query = unique(ATG7_res_ATG7_only_signif %>% 
                                    arrange(padjust) %>% 
                                    .[["geneID"]]), 
                   organism = "hsapiens", 
                   ordered_query = F, 
                   significant = T, 
                   user_threshold = 0.05, 
                   correction_method = c("g_SCS"),
                   domain_scope = c("custom"),
                   custom_bg = unique(ATG7_res_all$geneID),
                   sources = c("GO", "KEGG", "REAC"), 
                   evcodes = F)

hits_gprofiler$result
gprofiler2::gostplot(hits_gprofiler)

6.2.7 Autophagy genes

  • How many known autophagy genes that are part of the KEGG pathway are found amongst the ATG7 outliers?
  • RAB39B comes up as upregulared in one patient (M0921.18). Though this is different from the RAB9 suggested by Jack.
ATG7_res_ATG7_only_signif[ATG7_res_ATG7_only_signif$Description %in% unique(autophagy_gene_info$gene_name),]
  • NFE2L2 does not look to have abnormal expression.
ATG7_res_ATG7_only %>% 
  filter(Description == "NFE2L2")

7 DESeq2

  • Based on the few number of significant hits from OUTRIDER (that limits our ability to run gene-set enrichment analysis), I believe that OUTRIDER may be over conservative when looking for differentially expressed pathways. As some perturbed genes may be differentially expressed, however not to the extent of being an outlier.
  • I have run DESeq2 using all ATG7 patients as cases vs the remaining patients (~50) with the goal of finding overall pathways that are changed common to the ATG7 patients.
  • For the covariate correction, here I have used sex/batch and library size.
load(file = "/home/dzhang/projects/ATG7_rob_t_analysis/results/DESeq2/ATG7_deseq2_res.rda")

7.1 Differentially expressed genes

  • There are 12,339 differentially expressed genes.
  • ATG7 a hit, however it looks to be slightly upregulated (we know from the above that at least one patient has a lower expressed of ATG7).
load(file = "/home/dzhang/projects/ATG7_rob_t_analysis/results/DESeq2/ATG7_deseq2_res.rda")

ATG7_deseq2_res %>% 
  filter(Description == "ATG7")

7.2 Gene-set enrichment

7.2.1 ATG7 patients

  • Here, I have tried using both the ranked (based on p-value) method and also the unranked. Below are the results shown for the ranked method as these gave a more mitochondrial result.
  • We have 448 significantly dysregulated pathways. Mitochondrial pathways are the top 2 hits with other mitochondrial related pathways amongst the significant hits.
  • There are no "autophagy" pathways with these hits
load("/home/dzhang/projects/ATG7_rob_t_analysis/results/OUTRIDER/50_controls/ATG7_res_all.rda")

hits_gprofiler <- 
  gprofiler2::gost(query = ATG7_deseq2_res %>% 
                     arrange(pvalue) %>% 
                     .[["Description"]], 
                   organism = "hsapiens", 
                   ordered_query = T, 
                   significant = T, 
                   user_threshold = 0.05, 
                   correction_method = c("g_SCS"),
                   domain_scope = c("custom"),
                   custom_bg = unique(ATG7_res_all$geneID),
                   sources = c("GO", "KEGG", "REAC"), 
                   evcodes = F)

hits_gprofiler$result %>% 
  as_tibble() 
gprofiler2::gostplot(hits_gprofiler)

7.2.2 Control samples

  • The previous set of results were purely tagging the effect of batch variation. Here, I have also reun DESeq2 on the 2 COL6A samples as a negative control.
  • The results, reassuringly, look to be completely different.
load(file = "/home/dzhang/projects/ATG7_rob_t_analysis/results/DESeq2/COL6A_deseq2_res.rda")

hits_gprofiler <- 
  gprofiler2::gost(query = COL6A_deseq2_res %>% 
                     arrange(pvalue) %>% 
                     .[["Description"]], 
                   organism = "hsapiens", 
                   ordered_query = T, 
                   significant = T, 
                   user_threshold = 0.05, 
                   correction_method = c("g_SCS"),
                   domain_scope = c("custom"),
                   custom_bg = unique(ATG7_res_all$geneID),
                   sources = c("GO", "KEGG", "REAC"), 
                   evcodes = F)

hits_gprofiler$result %>% 
  as_tibble()
gprofiler2::gostplot(hits_gprofiler)